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Abstract 

In this paper we study the effective degrees of freedom of the reduced rank regression 
estimator in the framework of Stein's unbiased risk estimation (SURE). We derive a 
finite-sample exact unbiased estimator of the degrees of freedom for the reduced rank 
regression. We show that it is significantly different from the number of free parameters 
in the model, which is often taken as a heuristic estimate of the degrees of freedom 
for the reduced rank regression. Using the exact unbiased estimator of the degrees of 
freedom, one can easily employ various model selection criteria such as Mallow's C p 
or GCV to efficiently choose an optimal rank for the reduced rank regression problem, 
which often outperforms its heuristic counterpart in terms of prediction accuracy and 
successfully avoids computationally expensive data perturbation or bootstrap based 
methods. We have also extended the proposed approach to other related estimation 
procedures, including the reduced rank ridge regression and a weighted nuclear norm 
penalized multivariate regression. 

Key Words: Degrees of freedom, Model selection, Multivariate regression, Reduced 
rank regression. 



1 Introduction 



Multivariate linear regression is the extension of the classical univariate regression model to 
the case where we have q(> 1) responses and p predictors. It is commonly used in bioinfor- 
matics, chemometrics, econometrics, and other quantitative fields where one is interested in 
predicting several responses simultaneously. 

We can express the multivariate linear regression model in matrix notation as follows. Let 
X denote the n x p predictor or design matrix, with the z-th row Xj = (xa, Xi%, • • • , %i P ) £ 
Similarly the n x q dimensional response matrix is denoted by Y, where the z-th row is 
Yi — (ya,Vi2, ■ ■ ■ )Viq) The regression parameters are given by the coefficient matrix 

B which is of dimension p x q. Note that the fc-th column of B is the regression coefficient 
vector for regressing the fc-th response on the predictors. Let E denote the n x q random 
error matrix with independent entries with mean zero and variance a 2 . Then the multivariate 
linear regression model is given by 

Y = XB + E. (1) 

Note that, this reduces to the classical univariate regression model when q — 1. For nota- 
tional simplicity, we assume that the responses and the predictors are centered, and hence 
the intercept term can be omitted without any loss of generality. The ordinary least squares 
approach of estimating B leads to 



B is = ( xTx r lxTY - 



The ordinary least squares estimate amounts to performing q separate univariate regressions 
and completely ignores the multivariate aspect of the problem, where many of the responses 
might be highly correlated and hence the effective dimensionality can be much smaller than 
q. Quite a large number of methods have been proposed in the literature to overcome these 
drawbacks. Many of them would fall under the general class of linear factor regression, 
where the responses are regressed against a small number of linear combination of predic- 



tors commonly known as factors. Examples include principal component regression (Massy 



1965), partial least squares (Wold, 1975), canonical correlation analysis (Hotelling, 1935) 



and so on. The methods differ in the way they choose the factors. Recently [Witten et al. 



(2009) introduced a penalized canonical correlation analysis using sparse matrix factorization 



that leads to more interpretable factors and is more suitable for high- dimensional problems. 



Breiman and Friedman (1997) proposed the curds and whey(C&W) approach which borrows 



strength by performing a second round of regression of the responses on the ordinary least 
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squares estimators. The authors also show some close connections of the C&W approach 
with canonical correlation analysis. 



Several penalization methods have been proposed in recent years to address the issues of 



prediction performance and variable selection in multivariate regression. Turlach et al. 



(2005) introduced an penalty on the rows of B to encourage simultaneous variable 



selection. Peng et al. (2009) used a combined penalty function of the form, J7"(B) = 
Ai $^i=il|Bj.||i + \i X/i=ill-^i-ll2 to identify "master predictors" in genomics studies. Note 
that, Bj. denotes the j-th row of the coefficient matrix. The first part of the penalty im- 
poses sparsity on entries of B, whereas the second part forces some of the entire rows of B 
to be zero, encouraging the selection of "master predictors" that influence many response 



variables. Obozinski et al. (2011) developed asymptotic theory for the txfli penalized mul- 



tivariate regression problem, which can also be thought of as a special case of the joint 
penalty employed by Peng et al. (2009) with Ai = 0. In particular, they prove that the 



multivariate group lasso penalty recovers the correct row support with high probability in 
high-dimensional settings. 

The methods discussed above do not account for correlated errors. This motivated IRothmanl 



et al. (2009) to propose a joint method for variable selection and covariance estimation. Let 



f2 denote the inverse covariance matrix for the errors. Then the penalty function employed 
by 



Rothman et al. 



(2009) takes the form, J(B,fl) = XiE^Mfl + A 2 ELi ELiM- 



This leads to interpretable models that predict better. Lee and Liu (2012) explore this fur 



ther by introducing two plug-in estimators as well as a joint method that employs weighted 
t\ penalty to estimate (B, Q) simultaneously. In addition they prove that their estimation 
methods are consistent and admit asymptotic distributions. 



Yet another line of research focuses on the rank of the regression coefficient matrix. |Anderson 



(1951) proposed a class of regression models that restrict the rank of the true coefficient 



matrix to be much smaller than the dimensionality of B, i.e. rank(B) < r < min{p, q}. 
This is a quite reasonable assumption in many multivariate regression problems, which can 
be reformulated as follows: the q responses are related to the p predictors only through r 
effective linear factors. It results in the following optimization problem 



B(r) 



argmm 

{B:rank(B)<r} 



|Y - XB\\%. 



(2) 



Even though the rank penalty makes it a non-convex optimization problem, it admits a 



closed form solution as we shall see later. Izenman ( 1975 ) introduced the term reduced rank 
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regression for this class of models and derived the asymptotic distributions and confidence in- 
tervals for reduced rank regression estimators. A non-exhaustive list of notable work includes 



Rao 


(1978 


)■ 


Davies and Tso 


(1982 


), 


Anderson 


(1999 


Izenman 


( 


2008 


) for a more comprehensive account. 



interest in the reduced rank methods. Instead of restricting the rank, Yuan et al. (2007) 
proposed to put an i\ penalty on the singular values of B also known as the nuclear norm. 
This can be considered as the convex relaxation to the reduced rank regression criterion, but 



the resulting optimization problem is quite non-trivial. Bunea et al. (2011) came up with 
an innovative ^-penalized viewpoint of the reduced rank regression. Under that framework 
they are able to characterize the choice of tuning parameter, which guarantees asymptotic 



consistency in terms of rank selection. Chen et al. (2012a) adopted sparsity penalties on sin- 



gular vectors for reduced rank regression problems that lead to more interpretable models. 



Very recently Chen et al. (2012b) proposed a weighted nuclear norm penalty on the signal 
matrix XB; the resulting optimization problem admits a closed form solution, which enjoys 
many desirable theoretical properties. We note that most of the theoretical results obtained 
so far are asymptotic in nature. 



In this paper we study the degrees of freedom of the reduced rank regression model. The 
degrees of freedom is a very familiar and one of the most widely used terms in statistics. 
We utilize it from ANOVA t-tests to model selection criteria such as AIC and BIC. How- 
ever, it has been largely overlooked in the reduced rank regression literature except for some 



heuristic suggestions (Davies and Tso, 1982; Reinsel and Velu, 1998 Chen et al. , 2012b). 



For example, the number of free parameters in a p x q matrix of rank r, given by r(p + q — r) 
has been suggested as a naive estimate of the degrees of freedom of the reduced rank regres- 
sion model restricted to rank r < min{p, q}. In this paper, we aim to find a finite-sample 
unbiased estimator of the degrees of freedom for the reduced rank regression model and 
investigate its properties. The result covers a significant gap in the literature, as the pre- 
viously suggested naive estimate lacks both statistical motivation and practical performance. 



In a nutshell, the degrees of freedom quantifies the complexity of a statistical modeling pro- 



cedure (Hastie and Tibshirani, 1990). In the case of the univariate linear regression model, 
it is well-known that the degrees of freedom is the number of estimated parameters, p. How- 
ever, in general there is no exact correspondence between the degrees of freedom and the 



number of free parameters in the model (Ye, 1998). For example, in the best subset selection 



for univariate regression (Hocking and Leslie, 1967) we search for the best model of size 
Po G {1,2, ... ,p} that minimizes the residual sum of squares. The resulting model has po 
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parameters but intuitively the degrees of freedom would be higher than p$ since the search 



for the "optimal" subset of size po increases model complexity (jHastie et al. , 2009). In other 



words, for best subset selection the optimal p -dimensional subspace that minimizes the 
residual sum of squares clearly depends on y. Thus the final estimator is highly non-linear 
in y, which results in the loss of correspondence between degrees of freedom and the number 
of parameters in the model. 

Similar arguments also apply to the reduced rank regression. Instead of searching for best 
Po-variables as in the case of best subset selection, here we are searching for best go linear 
combinations of the predictors that minimize the least squares loss, which should intuitively 
suggest increased model complexity. Since the optimal rank g _su bspace depends on the re- 
sponse matrix Y, the natural correspondence between number of free parameters and degrees 
of freedom need not hold. This is where reduced rank regression is different from other linear 



factor regression methods, e.g. principal component regression (Massy, 1965). In principal 



component regression, the factors are principal components of the design matrix X, which 
do not depend on the response Y, thus the final estimator is still linear in Y. 

The rest of the paper is organized as follows. In section [2j we review the degrees of freedom 



in the framework of Stein's unbiased risk estimation (Stein, 1981). Section 3 contains our 



proposed estimator of the degrees of freedom of the reduced rank regression model. In section 
[3J we extend the methodology to other related methods, including the reduced rank ridge 
regression and a reduced rank regression model that uses the weighted nuclear norm penalty. 
In section [5j we show that the exact unbiased estimator of the degrees of freedom of the 
reduced rank regression model can be significantly different from the naive estimator through 
several numerical examples. We also show that when using the degrees of freedom derived 
from the previous sections, one can gain prediction accuracy over its heuristic counterpart 
with the reduced rank regression. In section [6j we apply the developed method to a genetic 
association data example, and we conclude the paper with a discussion in section [7j 



2 Degrees of freedom 



Stein (1981) in his theory of unbiased risk estimation (SURE) first introduced a rigorous 



definition of the degrees of freedom of a statistical estimation procedure. Later Efron (2004) 



showed that Stein's treatment can be considered as a special case of a more general notion 
under the assumption of Gaussianity. Assume that we have data of the form (y n xi, X nxp ). 
Given X, the response originates from the following model y ~ (/x, cr 2 I), where \x is the true 
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mean which can be a function of X, and a 2 is the common variance. Then for any estimation 
procedure m(-) with fitted values ji = m(X, y), the degrees of freedom of m(-) is defined as 



d f( m ) = ^2cov(p,i,yi)/a' 



(3) 



i=l 



The rationale is that more complex models would try to fit the data better, and hence 
the covariance between observed and fitted pairs would be higher. This expression is not 
directly observable except for certain simple cases, for example, when m(y) = Sy, a linear 
smoother. In that case, it is not difficult to see that df(m) = tr(S). Stein was able to 
overcome this hurdle for a special case when y ~ N(/j,, a 2 1). Using a simple equality for the 
Gaussian distribution, he proved that as long as the partial derivatives dfii/dyi exists almost 
everywhere for all i £ {1, 2, . . . ,n}, the following holds 



cov(fii,yi 



er E 



dfij 
dyi 



Thus, we have the following unbiased estimator of the degrees of freedom of the fitting 
procedure m(-) 

(4) 

Using the degrees of freedom definition as in Q, |Efam] fl2004| employed the covariance 



penalty approach to prove that the C p -type statistics (Mallow, 1973 ) is an unbiased estimator 
of the true prediction error, where 



cm 



n 



2jm a2 ^ 

n 



This reveals the important role played by the degrees of freedom in model assessment. It 
gives us a principled way of selecting the optimal model without going for computationally 
expensive methods such as cross-validation, and in certain settings it can offer significantly 



better prediction accuracy than such methods (Efron, 2004). Many important works fol 



lowed that of Stein (1981) and Efron (2004). For example, Donoho and Johnstone (1995) 



used the SURE theory to derive the degrees of freedom for the soft-thresholding operator 



in wavelet shrinkage; Meyer and Woodroofe (2000) employed this framework to derive the 



same for shape restricted regression; Li and Zhu (2007) also used this framework to derive 



an unbiased estimator of the degrees of freedom for penalized quantile regression. Zou et al. 



(2007) applied the SURE theory for the popular regression shrinkage and variable selection 



method lasso (Tibshirani, 1996). This is a challenging problem because of the non-linear na- 



ture of lasso solution, which does not admit an analytical solution except for certain special 
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cases. Using sophisticated mathematical analysis, Zou et al. (2007) were able to show that 
the number of non-zero coefficients provides an unbiased estimate of the degrees of freedom 
for the lasso. This is a result of great practical importance since this allows one to come up 
with model selection criteria such as C p and BIC for the lasso without incurring any extra 
computational cost. 



The degrees of freedom for the reduced rank regression model also proves to be a challenging 
problem because of the non-linearity of the estimator. As we will see shortly, even though it 
admits a closed-form solution, the solution is highly non-linear depending on singular value 
decomposition of certain matrices that in turn depends on the response matrix Y. In the 
next section, we study the degrees of freedom of the reduced rank regression model in the 
framework of SURE and propose a finite-sample exactly unbiased estimator. The impor- 



tance of such an estimator has been emphasized repeatedly by Shen and Ye (2002), Efron 



(2004 


), 


Zou et al. 


(2007) 



To overcome the analytical difficulty in computing the degrees of freedom, Ye (1998 ) and Shen 



and Ye (2002) proposed the generalized degrees of freedom approach, where they evaluate 



Q numerically, using data perturbation techniques to compute an approximately unbiased 



estimator of the degrees of freedom. Efron (2004) also proposed a bootstrap based idea to 



arrive at an approximately unbiased estimator of ([3]). Though these kind of simulation based 
approaches allow us to extend the degrees of freedom approach to many highly non-linear 
modeling frameworks, they are also computationally expensive. 



3 Derivation for the reduced rank regression model 

In this section, we propose an analytic method that computes an unbiased estimate of the 
degrees of freedom for the reduced rank regression model. Recall the multivariate linear 
regression model as in Q, which can be rewritten as follows 

vec{Y) = [I q ® X}vec(B) + vec(E), 

nqxl nqxpq pqxl nqxl 

where ® is the usual Kronecker product between matrices and vec(-) is the column-wise 
vectorization operator on a matrix. We start by characterizing the solution for the reduced 
rank regression model. Let Y Q j s be the usual least squares estimate which admits a singular 
value decomposition of the form 

Y olg = X(X T X)- 1 X T Y = UDV T , (5) 
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where U nX n and V gxg are orthogonal matrices that represent the left and right singular 
vectors and D nxq is a diagonal matrix of singular values of Y Q j s . We will denote the fc-th 



column of U and V by w& and Vj. respectively. Using the Eckart- Young theorem (Eckart and 



Young, 1936), it is not difficult to show that the fitted values for the reduced rank regression 



model with the estimated coefficients by (|2]) can be expressed as 

r 



k=l 



where A r denotes the first r-columns of a generic matrix A. Following definition Q, we can 
naturally define the degrees of freedom for the reduced rank regression model as 

where tr(-) denotes the trace operator for a real square matrix. A direct computation of all 
partial derivatives within ([6]), even when being feasible, can be a considerably computational 
burden when the sample size, n, is large. We reframe the question below so that the number 
of partial derivatives to be calculated is of order pq and is not a function of n. 

Define 

F = (X T X) 1 / 2 (X T X)- 1 X T Y = (X T X)- 1 / 2 X T Y, 

where A 1//2 denotes the square root of a symmetric positive definite matrix A. Considering 
the singular value decomposition in (J5j), it is straightforward to see that F admits a singular 
value decomposition of the form 

F = W D V T . 

pxq qxq qxq 

Furthermore, with Y q1s = X(X T X)- 1 / 2 F, we can express the reduced rank regression esti- 
mator as, 

Y(r) = X(X t X)-V'f$>t£ 

k=l 

= X(X T X)- 1/2 W r D r V rT . 
Now using matrix properties such as tr(AB) = tr(BA) and 

vec(ABC) = (C ® A)uec(B), (7) 
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we obtain our unbiased estimator of the degrees of freedom for the reduced rank regression 
model as, 

f dvec(W r B r V rT 



df(r) = tr 
The details are given in the Appendix 



\ dvec(F) 



(8) 



V1~x.P1 



It is clear, by the chain rule of derivatives, that our remaining task is to compute the following 
quantities, 

I f)i)pr("\M\ f)i)f>r(T)} f)i)f>r(\l\ l 

(9) 



J dvec(W) dvec(D) dvec(V) 1 
\ dvec(F) ' dvec{F) ' dvec(F) J ' 



Note that the singular values and vectors of a matrix are not only highly non-linear func- 
tions of the underlying matrix, they are also discontinuous on certain subsets of matrices. 
This implies that degrees of freedom calculation for the reduced rank regression is a very 
challenging problem. There has been a considerable amount of work on the smoothness and 
differentiability of the singular value decomposition of a real matrix in applied mathematics 
literature. The main result of interest to us states - the singular values and singular vectors 
of a real matrix F are smooth different iable functions of the entries of F as long as F has 



full rank and no repeated singular values. Main references include Magnus and Neudecker 



(1998 


), 


O'Neil 


( 


2005 


) and 


de Leeuw 


( 


2007 



steps to complete the task of obtaining an unbiased estimator of the degrees of freedom for 
the reduced rank regression model under the SURE framework: 

1. Derive the partial derivatives in Q over the set of matrices as described above. 

2. Prove that the set where the partial derivatives do not exist has Lebesgue measure 0. 

The following two subsections will address the aforementioned steps respectively and thus 
will complete the derivation. 



3.1 Jacobian of the singular value decomposition 

In this subsection, we propose a method that computes the partial derivatives of the singular 
values and singular vectors of a matrix with respect to the entries of the underlying matrix, 



and we acknowledge that we borrow the idea from Wright (1992) and Papadopoulo and 



Lourakis (2000). We will present the results in the context of a generic p x q matrix F, and 
we will only present the case where p > q; the reverse scenario follows similarly. Assume 
that F admits a singular value decomposition, 



F = W D V T , 

pxp pxq qxq 



such that W T W = WW T = l p and V T V = VV T = I,. Define d k = d kk to be the 
A;-th diagonal element of D; ft 3 w = W T <9W /dfjk, where fj k is the (j,k)th entry of F, 

and equivalently define Ay = ^Tlyj^j = d\ T /dfj^V. Then, by noting W T W = I p and 

VV T = I q , we obtain dW/df jk = WO^ and dV/df jk = -VAf . Consequently, our task 
is equivalent to obtaining fi^, Ay and dd k /dfj k , for j = 1,2, ... ,p; k = 1,2, ... ,q, which 
can be completed by 

(i) Showing 

wj L = WjkVkk, k = l,2,...,q. (10) 

Cjjk 

(ii) Obtaining fl J w (f, k') and Ay(j', k') for j' ^ k',1 < j', k' < q by solving a set of systems 
of linear equations, 

d k 'ft J w (f, k') + dfA J v (f, k') = Wjjiv kk/ 
dj>Q 3 w (J', k') + d k >A 3 y(f, k') = -w jk <v kj > 



^ f ., .a , . A ifc/./ forj" = l,2,...,g-l;A; / = j" + l,...,g. 



(11) 

(iii) Obtaining fi^^'', fc') for g + 1 < j' < p and 1 < k' < q by solving 

d i /n $(/> = ~ w jk'Vkf for j' = q + 1, . . . ,p; k' = 1, . . . , q. (12) 

Note that we can solve for Ay fully, while for il J w we are only able to get a partial solution, 
where the last p x (p — q) block remains undetermined. Even though the partial solution of 
implies that we can only solve for the first q columns of d"W/dfj k , it is sufficient since 
the first q columns of W are enough to determine the singular value decomposition of an 
p x q matrix with p > q. 

We complete this subsection by noting that the items (i)-(iii) above are direct consequences 
of pre-multiplying by W T and post-multiplying by V the following equation 

9F W DV- + w|£v- + WD SVr 



dfjk dfjk dfjk dfjk 

which leads to 



W^V = n#D + |5- + DAy . (13) 



Item (i) and (10) are the result of matching the diagonal entries in (13) and using the fact 



that ft J w and Ay are anti-symmetric matrices; a matrix A is anti-symmetric if A + A T = 0. 



Note that both sides of ( 13 ) are p x q matrices with p > q. Items (ii) and (iii) are obtained 



by matching the upper q x q submatrices and by matching the lower (p — q) x q submatrices 



from the two sides of (13), respectively. 
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3.2 Existence of partial derivatives almost everywhere 



Studying the system of linear equations (11) and (12), it is not difficult to find that the 
solution exists only if, 

di > d 2 > . . . > d q > 0. 
That is, Yqls must be of rank q and it cannot have repeated singular values as dercribed in 



O'Neil (2005). The following theorem gives the main result that allows us to apply Stein's 



framework in deriving the degrees of freedom of the reduced rank regression problem. 

Theorem 3.1. Let M. pxq be the space of all real-valued p x q dimensional matrices equipped 
with the Lebesgue measure fi. Also, let S C W xq denote the subset of matrices that have full 
rank and no repeated singular values. Then 

fi(S) = 1. 

To prove the theorem, we start with a few definitions and facts from algebraic geometry and 
matrix analysis. 

Definition 3.2. An algebraic variety over IR fc (or C k ) is defined as the set of points satisfying 
a system of polynomial equations {fi(xi,x 2 , ■ ■ ■ , x^) = 0; £ G X}. 

Here each fg(-) is a polynomial function of its arguments and X denotes an index set. If at 
least one of the fi(-) ^ 0, then it is called a proper subvariety. Note that a proper subvariety 



must be of dimension less than k and therefore has Lebesgue measure in R (Allman et al. 



2009). For a more detailed discussion, we recommend Hartshorne (1977) or Cox et al. (2007). 



R k ( 


Allman et al. 


Cox et al. 


(2007) 



Proposition 3.3. (Laub, 2004) Any square symmetric matrix M e M fcxfc has at least one 
repeated eigenvalue if and only if rank (M <g> 1^ — <g> M) < (k 2 — k) . 

Now we prove the theorem. First we define 



<Si 

s 2 



{AG 
{AG 



l pxq : A has at least one singular value}, 

i pxq : A has at least one repeated singular value}. 



Note that S c = SiUS 2 , thus it is enough to show that /x(Si) = and fi(S 2 ) = 0. By definition 



3.2| and the discussion above it suffices to show that Si and S 2 are proper subvarieties of 

0}. 



l pxq . Si can be rewritten as follows 

<Si = {A G R pxq : det(A T A) 



Here det(-) denotes the determinant operator for a square matrix. Note that det(A T A) is a 
non-trivial polynomial in entries of A and hence Si is a proper subvariety and has Lebesgue 
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measure 0. For S2 note that if A e W xq has at least one repeated singular value, it implies 
that A T A e W xq has at least one repeated eigenvalue. Then in view of proposition |3.3[ S2 
can be reformulated as 

S 2 = {Ae R pxq : rank (A T A ® I q - I q ® A T A) < (g 2 - g)} . 

This is an algebraic variety since it can be expressed as the solution to all minors of order 
> (q 2 — q) being equal to 0, which are all polynomial equations in the entries of A. Thus, 
we have shown that, fi(Si U S2) = 0. 

3.3 Computational cost 

It is fairly straightforward to derive the computational complexity of the proposed method. 
For each j = 1, 2, . . . , p and k = 1, 2, . . . , q, we need to solve pq many 2x2 systems of linear 
equations. Hence the complexity of the procedure is of 0(p 2 q 2 ) once the initial singular value 
decomposition has been carried out. On the other hand, for the data perturbation method 



(Ye, 1998), we need to compute a singular value decomposition for each i — 1, 2, ... ,n and 
k = 1,2, ... ,q to numerically evaluate the derivatives. The singular value decomposition of 
an x q matrix has a computational complexity of O (ngmax {n, q}). Therefor the data per- 
turbation approach would lead to a complexity of O (n 2 q 2 max {n, q}), which is much higher 
than that of the exact method we propose. In addition, such data perturbation methods 
can often lead to "unstable" estimates. This is because even the state-of-art algorithms for 



computing the singular value decomposition, e.g. LAPACK (Anderson et al. 1994) is not 
very robust with respect to small perturbations. The computational cost of the bootstrap 
method (Efron, 2004) is O (Bnq max {n, q}), which in general is also inferior to 0{p 2 q 2 ). 



4 Extension to other reduced rank methods 

This section deals with the extension of the proposed degrees of freedom technique to two 



other recently proposed rank regularized multivariate regression methods (Mukherjee and 



Zhu 


2011 


Chen et al. 


2012b 



section is quite general and can be applied to several other problems that deal with rank 
regularization or singular value thresholding. This allows us to do a principled model selec- 
tion and avoid the computationally expensive cross-validation method for a wider range of 
problems. 
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4.1 Extension to reduced rank ridge regression 



Mukherjee and Zhu (2011) proposed a modified version of reduced rank regression model, 



which incorporates an additional ridge penalty to counter the issue of collinearity among the 
prediction variables. In particular, they choose to minimize 



B(A,r) 



argmin ||Y 

{B:rank(B)<r} 



XBI 



AIIBI 



F) 



(14) 



where 1 < r < min{p, q} and denotes the Frobenius norm of a matrix. The additional 
ridge penalty is motivated by the fact that it ensures a well-behaved estimator of B even in 
the presence of high-collinearity, whereas the rank penalty encourages dimension reduction. 
Note that this method has two tuning parameters, (A,r). The authors propose a fc-fold 
cross-validation approach to select the optimal model over a grid of tuning parameters. The 
degrees of freedom methodology developed in section [3] can be easily extended here. We 



start by describing the solution to the optimization problem (14). Denote the usual ridge 
regression estimator by 



Y R = X (X T X + AI) 1 X T Y = UDV T , 



where the rightmost part denotes the singular value decomposition. Following similar steps 



as in the case of reduced rank regression, we can show that the solution to ( 14 ) is given by 

Y(A,r) = U r D r V rT . 

Now, using the same chain of arguments as before it is straightforward to deduce that 
the degrees of freedom estimator for the reduced rank ridge regression model under Stein's 
framework is 



df(X,r) 



tr 



tr 



( dvec(Y(X, r)) dveclY 



\ dvec(Y R ) dvec(Y) 
dvec(WWV rT ^ 



dvec(Y R ) 



[I, ® X(X T X + AI^X 1 ] 



(15) 



To compute the first part of (15), i.e., the partial derivatives of singular values and vectors of 



a matrix with respect to the entries of the underlying matrix, we can employ the technique 



developed in subsection 3.1 Once (15) is computed, we can use any of the popular model 



selection criteria such as Mallow's C p , BIC or GCV to select an optimal model over a grid 
of tuning parameter values. 
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4.2 Extension to the adaptive SVD soft-thresholding estimator 



Recently, Chen et al. (2012b) proposed a very interesting method that they call adaptive 



SVD soft-thresholding estimator for multivariate regression. The rank penalty that we have 
discussed in great detail can be considered as an £ penalty on the singular values of B. 



Yuan et al. (2007) proposed a factor estimation and selection procedure, which penalizes 



the i\ norm of the singular values of B (nuclear norm) to encourage sparsity. But the 
resulting optimization problem is rather chellenging and does not admit any closed form 



solution. Chen et al. (2012b) choose to introduce a weighted nuclear norm penalty on the 



signal matrix XB. Specifically, the resulting optimization problem is 



min {p,q} 

B(A) = argmin-||Y-XB||| + A x " 
B 2 



k=l 



WkdkCXB) 



(16) 



where dk(A) denotes the fc-th largest singular value of a generic matrix A. The weights 
Wk are assumed to be non-negative and non-decreasing to ensure more severe penalization 
for smaller singular values. The advantages of this choice of penalty function are two-fold. 
Firstly, it focuses directly on the singular values of the signal matrix XB instead of the 
coefficient matrix B, which should lead to better prediction performance especially when X 



is highly collinear. Secondly, this leads to a closed form solution to (16). Before getting into 



the issue of model complexity, we first describe the solution to the optimization problem 



(16). We will continue to use the notation for the least squares estimator Y Q | S described in 
([5]). Define the adaptive soft-thresholding operator for a diagonal matrix A mxm as 

S\ W (A) = Diag{(A fc - \w k ) + : k = 1, 2, ... , m}, 



where (•)+ denotes the positive part. The authors show that the estimated response matrix 
for the proposed method is given by 



Y(A) = U<S Aw (D)V J 



Chen et al. (2012b) utilizes this closed form solution to prove desirable theoretical properties 



such as rank consistency and error bounds. But for practical purposes we still need a way 
of efficiently choosing the tuning parameter A. The authors propose two ways of doing that: 
the first one is based on computationally intensive cross-validation, and the second one is 
generalized cross-validation (GCV). For the GCV approach they choose to use the number 



of free parameters heuristic (Davies and Tso, 1982; Reinsel and Velu 1998). 
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It turns out that the degrees of freedom technique for rank regularization can also be extended 



to this type of weighted i\ penalty on singular values. Specifically, as done in (15), we have 

\dvec(Y oh ) dvec(Y) J 

= J ^«!) M X(X^)--xi (17) 
V dvec ( Y oh) J 



Once again, it is important to note that in order to obtain the estimator (17), it suffices 
to compute the partial derivatives of the singular values and singular vectors with respect 
to the underlying matrix, Y Q j s , which is what we are able to compute using the techniques 



developed in subsection 3.1 



5 Simulation studies 

In this section, we evaluate the performance of the proposed method by simulation studies. 
Specifically, we aim to demonstrate two things: 1) the exact unbiased estimator of the 
degrees of freedom for the reduced rank regression is in general significantly higher than the 
naive estimator; 2) using the exact estimator of the degrees of freedom enables us to gain 
prediction accuracy over the naive estimator. 



5.1 Unbiasedness 

In this simulation, we aim to show that the degrees of freedom estimator defined via ([8| is 
unbiased and it can be significantly higher than the naive estimator that simply counts the 
number of free parameters. Here unbiasedness is defined over the error distribution, and we 
treat X as a fixed design matrix. Parameters of the study are as follows, n = 200, p — 15 
and q = 12. Let S denote the covariance matrix of the predictor variables, X. We consider 
a low level of correlation among X, namely Xjy = 0.3' J— J Rows of the predictor matrix 
are generated independently from N p (0, S). To control the singular structure of B through 
the covariance of signals XB, B T SB, we take the left singular vectors of B the same as the 
eigenvectors of S , whereas the right singular vectors of B are generated by orthogonalizing a 
random standard normal matrix. The difference between successive non-zero singular value 
of B is fixed at 2. The error matrix is generated from i.i.d. standard normal distribution. We 
replicate the process 500 times; note that the design matrix remains fixed. We compare the 



proposed exact method against the data perturbation technique (Ye, 1998) and the paramet- 



ric bootstrap (Efron, 2004). For the data perturbation method, we consider 25 perturbations 
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of the response matrix for each replication to estimate the partial derivatives numerically. 
We used the choice of 0.1a for the perturbation size, where a is the error standard deviation. 
For the parametric bootstrap approach, we choose B = 50 bootstrap estimates for each 
replication to compute the covariance estimator using (§. The mean of the bootstrapped 
covariance estimator over the replications can be thought of as a Monte-Carlo estimator of 
the true degrees of freedom. This will be denoted by df(r). Ideally we would expect the 
proposed exact estimator to be fairly close to df(r) on average. We compare estimators 
against the naive degrees of freedom estimate namely, df n (r) = r(p + q — r), which denotes 
the number of free parameters in a p x q matrix of rank r. Note that the naive estimator 
does not depend on the data. 

First we consider the situation where the true underlying B has full rank. On the left of 
Figure [T] we see that out of the four estimators, i.e. the proposed exact method, the data 
perturbation estimator, the parametric bootstrap estimator and the naive estimator, the 
averages of the first three are almost indistinguishable to the naked eye and they are sig- 
nificantly higher (often 20% or 30% higher) than the of the naive estimator, which counts 
the number of free parameters. The figure on the right side allows us to get a sense of the 
variability of the estimation procedures. Clearly the standard error for the exact method is 
orders of magnitudes smaller than that of either data perturbation or parametric bootstrap 
estimators. 



Next we change the rank of B to 8 and repeat the same experiment. Once again we find in 
the panel on the left side of Figure [2] that the averages resulted by the first three estimation 
methods nearly overlap, and they are consistently higher than the naive estimate. The 
figure also reveals an interesting pattern, that is, the exact estimators seem to match the 
naive estimator very closely at the true rank, i.e. 8 in this case. We could not explain this 
behaviour theoretically at this point but this was observed in a wide variety of simulation 
studies. In addition, we find in the figure on the right that the standard error of the exact 
degrees of freedom becomes drastically higher once we go above the true underlying rank. 
This arises from the fact that once we go above the true rank, the singular values of Y Q j s 
basically correspond to noise, and can be very close to each other. Hence slight perturbations 
of the data might lead to different singular directions being selected, which implies higher 
variability in model complexity. This has also been noted by Ye (1998), that is, if we are 
trying to fit pure error components, the degrees of freedom tends to be higher and unstable. 
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Figure 1: B is full rank. Left: the average degrees of freedom estimate over 500 replications; 
Right: the standard error of the degrees of freedom estimate over 500 replications. 
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Figure 2: B has rank 8. Left: average degrees of freedom estimate over 500 replications; 
Right: the standard error of the degrees of freedom estimate over 500 replications. 
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5.2 Prediction performance 

The previous set of simulations have shown that the exact degrees of freedom estimator can 
be significantly different from the number of free parameters estimator. Degrees of freedom 
estimates are commonly used in various model selection criteria. In this subsection, we 
aim to show that for reduced rank regression, we can gain in prediction accuracy by using 
the exact degrees of freedom estimator in a model selection criterion instead of the naive 



estimator. Since our focus is on prediction accuracy, we consider Mallow's Cp (Mallow, 1973) 



and generalized cross- validation(GCV) QGolub et al.j [1979] as our model selection criteria. 



GCV has the added advantage that it does not require an estimate for the error variance 
and hence proven to be much popular in practice. In the context of reduced rank regression, 
they are defined as follows 

C p (r) = l(||Y-Y>)||| + 2a 2 #(r)), 

GCV(r) = HgjllMB 

We select the model that minimizes the respective criterion over 1 < r < min{p, q}. For this 
study, we choose p = 12, q — 10 and fix the true rank of B at 3. We consider two different 
sample sizes n = 50 and 200 as well as two levels for error variance, namely, a 2 = 1 and 2. 
Correlation among predictor variables is kept at a moderate level of 0.5. We consider two 
specific structures for the singular values of B 

Model I : tr(B) = {2.5, 1.0, 0.4, 0, ... 0} =>• ct(B t SB) = {17.23, 6.21, 0.26, 0, ... 0} , 
Model II : a(B) = {2.5, 2.0, 1.5, 0, ... 0} ^ a(B T SB) = {17.23, 8.82, 3.70, 0, ... 0} . 

From Model I to Model II, the non-zero singular values of B increase, which in turn implies 
that the non-zero singular values of B T SB increase, thus making the problem easier both 
in terms of signal to noise ratio (SNR) and the gap between the smallest non-zero singular 
value and 0. The latter was shown to play an important role in rank recovery and prediction 



performance by Bunea et al. (2011). Data is generated using the same procedure as in the 
previous study except that X is no longer fixed. We compare the prediction performance 
between several model selection criteria by the model error, which is defined as 

ME = tr j(B - B) T S(B - B)| . 



We repeat the simulation for 100 times at each combination of model, sample size and error 
variance. Recall that we will be comparing four model selection criteria, namely, C p with 
the exact degrees of freedom (C p (e)), C p with the naive degrees of freedom (C p (n)), GCV 
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with the exact degrees of freedom (GCV(e)) and GCV with the naive degrees of freedom 
(GCV(n)). We fit the optimal model based on each of the competing model selection criteria 
and compute the model error. Table [T] summarizes the results. 

Table 1: Prediction accuracy comparison of different model selection criteria, namely, C p (e), 
C p (n), GCV(e), GCV(n) 



Sample Size 


n = 50 


Error Var Model 


C p (e) C p (n) GCV(e) GCV(n) 


o> = i 1 

II 


1.408(0.31) 1.543(0.39) 1.393(0.31) 1.523(0.39) 
1.541(0.41) 1.636(0.56) 1.541(0.41) 1.627(0.54) 


a> = 2 ' 

II 


2.643(0.70) 2.857(0.83) 2.631(0.69) 2.831(0.83) 
3.231(0.89) 3.421(1.17) 3.231(0.89) 3.386(1.09) 


Sample Size 


n = 200 


Error Var Model 


C p (e) C p (n) GCV(e) GCV(n) 


o> = i 1 

II 


0.344(0.10) 0.361(0.11) 0.344(0.10) 0.362(0.11) 
0.310(0.07) 0.343(0.10) 0.310(0.07) 0.343(0.10) 


a> = 2 ' 

II 


0.708(0.16) 0.724(0.21) 0.708(0.16) 0.724(0.21) 
0.605(0.12) 0.650(0.17) 0.605(0.12) 0.650(0.17) 



We find that for both C p and GCV, using the proposed exact degrees of freedom estimator 
performs better in terms of prediction accuracy than its naive counterpart. It has lower av- 
erage prediction error as well as smaller standard error in all of the combinations of model, 
sample size and error variance. The relative gain is larger when the sample size is small and 
the smallest non-zero singular value of the limiting signal matrix is closer to 0. To better 
understand the advantage of using the exact degrees of freedom in a model selection crite- 
rion, we have also summarized in Table [2] the frequency of the rank selection among the 100 
replications at each combination of model and sample size when a 2 = 1. We have skipped 
the other case (a 2 = 2) for brevity. 

First focus on the combination of Model I and small sample size, i.e. n = 50. Recall that the 
smallest non-zero singular value of B T SB for Model I is rather small, i.e. 0.26, implying the 
effective rank for prediction purpose is 2 instead of 3. Interestingly we find that the C p and 
GCV criteria with the exact estimator for the degrees of freedom select the rank 2 model 
nearly 90% of times. On the other hand, the same criteria equipped with the naive degrees 
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Table 2: Comparison of rank selection performance of the competing model selection criteria, 
C p (e), Cp(n), GCV(e) and GCV(n). 



Sample Size 


Rank 




M 


odel I 




C p (e) 


C p (n) 


GCV(e) 


GCV(n) 




2 


88 


62 


91 


65 


n = 50 


3 

4 


12 




37 

1 


9 




35 






5 
















2 


4 


1 


4 


1 


n = 200 


3 

4 


89 

5 


80 

17 


89 

5 


79 

18 




5 


2 


2 


2 


2 


Sample Size 


Rank 


Model II 


Cp(e) 


C» 


GCV(e) 


GCV(n) 




2 














n = 200 


3 

4 


95 

5 


85 

14 


95 

5 


85 

15 




5 





1 










2 














n = 200 


3 

4 


98 

2 


81 

17 


98 

2 


81 

17 




5 





2 





2 



of freedom estimator has a more even distribution, approximately 65% - 35% of selecting the 
rank 2 versus the rank 3 model, but it performs sub-optimally to the exact estimator in terms 
of prediction accuracy. This is due to the bias-variance trade-off in the mean squared error. 
The third largest singular direction contains very little explanatory power, thus a simpler 
model is able to attain better prediction accuracy especially for small sample sizes. When 
we move to higher sample size of n — 200, we find that the rank selection performances of 
both approaches improve but the proposed method still does better than the naive approach 
in terms of rank selection. In Model II, the signal strength and the smallest non-zero sin- 
gular value are both higher, leading to a much better rank selection performance even at 
the smaller sample size, but once again, the model selection criteria with the exact degrees 
of freedom do better. Overall it appears that the naive estimator tends to overestimate the 
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rank slightly, which is expected since it underestimates the degrees of freedom of the reduced 
rank regression model. 



6 Analysis of arabidopsis thaliana data 

In this section, we apply the proposed method to a genetic association data set that was 



published in Wille et al. (2004). This is a microarray experiment aimed at understanding the 
regulatory control mechanisms between the isoprenoid gene network in Arabidopsis thaliana 
plant (more commonly known as thale cress or mouse-ear cress). It is known that isoprenoids 
serve many important biochemical functions in plants. To monitor the gene-expression lev- 
els, 118 GeneChip microarray experiments were carried out. The predictors consist of 39 
genes from two isoprenoid bio-synthesis pathways namely MVA and MEP, whereas the re- 
sponses consist of gene-expression of 795 genes from 56 metabolic pathways, many of which 
are downstream of the two pathways considered as predictors. Thus some of the responses 
are expected to show significant associations to the predictor genes. To facilitate it further, 
we select two downstream pathways namely, Caroteniod and Phytosterol as our responses. 
It has already been proven experimentally that the Carotenoid pathway is strongly attached 
to the MEP pathway, whereas the Phytosterol pathway is significantly related to the MVA 



pathway. See Wille et al. (2004) and the references therein for a more detailed discussion 
on the biological aspects. Finally we have 118 observations on p = 39 predictors and q = 36 
responses. All the predictors and responses are log-transformed to reduce the skewness of 
the data. We also standardize the responses in order to make them comparable. 



We split the data set randomly into training and test sets of equal size 
using the training samples and then we use it to predict on the test set. 
measure under consideration is the usual mean squared prediction error 

MSPE= A||Y test -Y test ||^. 

/ i(J 

Here Y denotes the predicted value and \\-\\f denotes the Frobenius norm of a matrix. The 
entire process is repeated 100 times based on random splits to ensure that the results remain 
robust to the process of splitting. Model selection methods under consideration are Mal- 
low's C p , GCV and BIC with the exact degrees of freedom as in ^ and the naive degrees 
of freedom which counts the number of free parameters in a matrix of restricted rank. 



The model is fit 
The performance 
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Figure 3: Boxplot of mean square prediction error of each method over 100 random splits. 

The mean squared prediction errors for each method are summarized using the boxplot in 
Figure [3] As we can see, for all three model selection criteria considered, the use of the exact 
unbiased estimator enables us to outperform the one which uses the heuristic estimator in 
terms of prediction accuracy. 

Table 3: Prediction accuracy and rank selection performance for the competing methods on 
the Arabidopsis thaliana data. 





Cp(e) 


Gp(n) 


GCV(e) 


GCV(n) 


BIC(e) 


BIC(n) 


OLS 


Avg(Pred Err) 


2.197 


2.243 


2.192 


2.282 


1.297 


1.387 


2.589 


Std(Pred Err) 


0.250 


0.246 


0.248 


0.246 


0.134 


0.201 


0.282 


Mean(Est Rank) 


8.760 


9.710 


8.680 


10.520 


1.090 


1.480 





Table [3] shows the average prediction error over the 100 random splits as well as its standard 
deviation. This also reflects that the model selection criteria with the exact unbiased estimate 
of the degrees of freedom tends to select smaller rank solutions than that corresponding to 
the heuristic estimate of number of free parameters. This is to be expected since the exact 
unbiased estimate is usually higher than the number of free parameters and thus penalizes 
more severely on models with higher compexity. 




7 Concluding remarks 

We have proposed an exact unbiased estimator of the degrees of freedom for the reduced 
rank regression model in the framework of SURE. Through simulations and an Arabidop- 
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sis thaliana data example, we show that the exact unbiased estimator can be significantly 
different than the heuristic estimator that simply counts the number of free parameters. 



The proposed estimator matches with the numerical estimators due to Ye (1998) and Efron 



(2004), which further verifies the validity of our method. It gives us a principled way of 
using various model selection criteria, such as, C p , GCV and BIC to select an optimal rank 
solution. The use of the proposed estimator often outperforms that of the heuristic coun- 
terpart. The new degrees of freedom estimator can be computed easily by solving a set of 
systems of linear equations without incurring much computational cost. This feature and 
the potentially inflated variation are the main drawback for data perturbation or parametric 
bootstrap estimators. The new estimator improves both aspects. We have also shown that 
the methods develpoed here are quite general and can be extended to other related estima- 
tion procedures that employ regularization of the singular values, such as the reduced rank 
ridge regression (Mukherjee and Zhu, 2011) and a weighted nuclear norm penalized reduced 
rank regression (Chen et al. 2012b). The nuclear norm regularized multivariate regression 
has attracted much attention in recent years, e.g. Yuan et al. (2007), Bunea et al. (2011) 
and Chen et al. (2012b). The resulting optimization problems are often solved via iterative 
algorithms and do not have closed form solutions except for certain special cases such as the 
orthogonal design matrix X. It is our future plan to extend the proposed degrees of freedom 
approach to the nuclear norm penalization, which is certainly a challenging problem. 
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Appendix 



Derivation of ^ 

Using simple matrix calculus, trace identity tr(AB) = tr(BA) and ([7]), we get the following 
simplified expression of (pi): 
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tr 
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